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1 Introduction 

The current common wisdom in the U.S. is that the powerful, cost-effective supercomputers 
of tomorrow will be based on commodity (RISC) micro-processors with cache memories. 
Already, most distributed systems in the world use such hardware as building blocks. This 
shift away from vector supercomputers and towards cache-based systems has brought about a 
change in programming paradigm, even when ignoring issues of parallelism. Vector machines 
require inner-loop independence and regular, non-pathological memory strides (usually this 
means: non-power-of-two strides) to allow efficient vectorization of array operations. Cache- 
based systems require spatial and temporal locality of data, so that data once read from 
main memory and stored in high-speed cache memory is used optimally before being written 
back to main memory (see Bailey ’93 [1]). This means that the most cache-friendly array 
operations are those that feature zero or unit stride, so that each unit of data read from main 
memory (a cache line) contains information for the next iteration in the loop. Moreover, 
loops ought to be ‘fat,’, meaning that as many operations as possible are performed on cache 
data — provided instruction caches do not overflow and enough registers are available. If 
unit stride is not possible, for example because of some data dependency, then care must be 
taken to avoid pathological strides, just as on vector computers. For cache-based systems 
the issues are more complex, due to the effects of associativity and of non-unit block (cache 
line) size (see Bailey ’95 [2]). But there is more to the story. Most modern micro-processors 
are superscalar, which means that they can issue several (arithmetic) instructions per clock 
cycle, provided that there are enough independent instructions in the loop body. This is 
another argument for providing fat loop bodies. 

With these restrictions, it appears fairly straightforward to produce code that will run effi- 
ciently on any cache-based system. It can be argued that although some of the important 
computational algorithms employed at NASA Ames require different programming styles on 
vector machines and cache-based machines, respectively, neither architecture class appeared 
to be favored by particular algorithms in principle. Practice tells us that the situation is more 
complicated. This report presents observations and some analysis of performance tuning for 
cache-based systems. We point out several counterintuitive results that serve as a cautionary 
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reminder that memory accesses are not the only factors that determine performance, and 
that within the class of cache-based systems, significant differences exist. 

2 Kernel code 

The most important flow solver program in use at NASA Ames is OVERFLOW, based on 
an Alternating-Direction Implicit (ADI) algorithm described by Beam and Warming [3]. 
The efficient diagonalized form of the algorithm, described by Pulliam and Chaussee [4], 
is the most commonly used solver kernel of OVERFLOW. Its essence is captured in the 
Scalar Penta-diagonal (SP) program, part of the suite of source codes that make up the 
NAS Parallel Benchmarks 2 (NPB 2) [5]. SP constitutes a stress test on the memory system 
of the computer, since there are fairly few operations per grid point to be executed in any 
one loop of the code. The most critical part of the code is the line solver, which solves 
independent systems of linear equations associated with grid lines. Since there are three 
families of grid lines in three-dimensional space, there are also three different solver routines 
for the three so-called factors. We use these three routines, with the generic names x-loop, 
y-loop, and z-loop, respectively, as examples of codes to be run on cache-based systems. 
They are described in Appendix A. Successive cumulative optimizations attempted in these 
codes, indicated by suffixes 1 through 5, are as follows. 

1. Baseline, which is now contained in NPB 2. 

2. Eliminate temporaries for incremented indices, i.e. i+1, i+2, j+1, j+2, k+1, k+2 
insteadofil, i2, jl, j2, kl, k2. This enables good compilers to do better register 
allocation. 

3. Unroll inner loops of fixed length (m=l,2,3). This reduces loop overhead and register 
demand. 

4. Move the fourth index (m) of the arrays rhs , lhs to the first position. This improves 
spatial data locality, since each iteration of the inner loop uses all three elements of 
rhs and all five elements of lhs at each grid point. 

5. Unroll the first available loop (j for x-loop, i for y-loop and z-loop) to a level of 
two. This increases the number of independent computations in the inner loop. Notice 
the location of the assignment fac2 = 1 .d0/lhs(3,i, j+l,k). It is moved to the front 
of the inner loop, far ahead of the subsequent assignments that make use of f ac2, to 
improve possibilities for optimal scheduling by the compiler. 

In the cases of y-loop and z-loop there are two more code variations possible that have to 
do with the order of the loops. The previous optimizations all employ the so-called canonical 
loop nest orderings for x-loop, y-loop and z-loop, namely running index k for the outer 
loop, j for the middle loop, and i for the inner loop. But it is most natural computationally 
to finish a whole grid line in the inner loop before moving to the next grid line. 
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6. Use j and k as the running index for y-loop and z-loop, respectively, while keeping 
the unrollings described in optimization 5. This causes large strides in the arrays, 
which is generally bad. But the benefit is that there is some overlap in subsequent 
iterations, since there is a recursion in the grid line direction in the solver algorithm. 
The running index for the middle loop in each of the two loop nests is always i. 

7. Use the same natural loop order as in the previous optimization, but eliminate unrolling 
optimization 5. 

Each subroutine is executed for four grid sizes, i.e. 16 3 , 32 3 , 64 3 , and 80 3 . To avoid bad strides 
(see optimizations 6 and 7), all array grid dimensions are padded by 1. Tests with different 
paddings did not produce improvements on the architectures under consideration, and cache 
simulations (see Section 5) indicate that no cache lines remain unused when padding is set 
to 1. Moreover, when the m-index is moved into the first position, primeness of its size (3 for 
rhs and 5 for lhs) effectively guarantees safe strides. An optimization that was attempted 
but that is not reported quantitatively here concerns transposition of the arrays to align the 
line solve direction with the second array index (following m). This technique, which can pay 
off well when many operations are performed at each grid point, such as in FFTs on large 
grids, obtained very poor results for SP due to the sparseness of computations. 

3 Machines 

The machines investigated in this report are mainly RISC processors: MIPS R5000, MIPS 
R8000, MIPS R10000, DEC Alpha EV4, DEC Alpha EV5, IBM RS6000, Sun UltraSparc I. 
The one CISC architecture is the Intel PentiumPro. A number of these processors are cur- 
rently used in parallel platforms: MIPS R8000 in the SGI PowerChallenge Davinci cluster 
at NASA Ames, MIPSR10000 in the SGI 0rigin2000 cluster at the University of Urbana 
Champaign, DEC Alpha EV4 in the Cray T3D Cosmos system at the Jet Propulsion Labo- 
ratory, DEC Alpha EV5 in the Cray T3E machine at the National Energy Research Scientific 
Computing Center in Berkeley, IBM RS6000 in the SP Wide Node Babbage system at NASA 
Ames, and Sun UltraSparc I in the University of Berkeley Network of Workstations (NOW) 
system. 

The PentiumPro is being considered for a new massively parallel system called Whitney, 
currently under construction at NASA Ames. Machine specifics are summarized in Table 
1. They reflect the alterations made to the processors to integrate them in their parallel 
platforms. In particular, the DEC Alpha EV4 and EV5 modified by Cray Research lost 
their off-chip second-level and third-level caches, respectively, for numerical processing. One 
of the most important system parameters, the memory bandwidth, is not listed, because 
vendor-provided data are generally inconclusive. It is assumed that memory bandwidth is 
always an active constraint on processor performance. 

On all machines we select the highest acceptable level of optimization for the Fortran com- 
piler (generally -03). Native Fortran compilers are available for all platforms, except the 
PentiumPro. On the latter system the compiler from the Portland Group is used. 
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4 Measured performance results 


Performances of the processors-and compilers — tested, in MFlop/s (millions of floating point 
computations per second), are presented in Appendix B. They are obtained by running each 
of the loop nests in Appendix A a number of times on each system, ignoring the first iteration 
to eliminate start-up overhead, and determining the minimum of the execution times for the 
remaining iterations to eliminate noise. 

The results for each processor are shown in three different ways, as indicated in the schematic 
in Figure 1. The first display (Figure a) scales the performance within each set of computa- 
tions for a certain grid size and a certain factor (x, y, or z). Maximum performance relative 
to the other optimization strategies for the same grid size and factor is indicated in black, 
minimum in light gray. This type of display allows the comparison of a particular optimiza- 
tion across all grid sizes and factors; for example, if row 3 is black for the y-factor for a 
certain processor (as shown in Figure la), then unrolling of the m-loop, in conjunction with 
optimizations 2, guarantees optimal performance of that y-factor. If rows 3 for the x- factor 
and 2 -factor were black as well, then unrolling the m-loop would be a generally optimal 
strategy for that processor. 

The second display (Figure b) scales the performance within each set of computations for a 
certain factor ( x , y, or z). This allows the easy assessment of the influence of grid size (and 
hence of array size) on performance. For example, if black blocks occur only in the left few 
columns of figure b (as shown in Figure lb), then the processor’s performance apparently 
reduces when the problem grows too large to fit completely in the cache. 

The third display (Figure c) scales the performance within the whole set of computations 
for each processor. This facilitates the combined assessment of cache size and stride on the 
performance. Factor x features the smallest stride, and z the largest, even when the loop 
ordering is theoretically optimal (outer loop k, then j, and inner loop i). This may be borne 
out by the performance results if the maximum performance is obtained for the x-factor, as 
suggested in Figure lc. 

Note that the absolute magnitudes of the performance numbers are the same for correspond- 
ing cases in each set of three figures a, b and c; it is only the shading that differs. Some 


Table 1: Processor specifications summary 


Name 

RAM 

MBytes 

L 1-cache 
KBytes 

L2-cache 

KBytes 

cache line 
Bytes 

associativity 

CPU 

MHz 

Peak 

MFlop/s 

R5000 

64 

32i+32d 

0 

32 

2 

150 

300 

R8000 

4096 

16i+16d 

4096 

512 

4 

90 

360 

R10000 

4096 

32i+32d 

4096 

128 

2 

195 

390 

EV4 

8 

8i+8d 

0 

32 

1 

150 

300 

EV5 

8 

8i+8d 

96 

32 

3 

300 

600 

RS6000 

128 

32i 

1 

256 

256 

4 

66 

267 

PPro 

128 

8i+8d 

256/512 

32 

4 

200 

200 

Sparc 

1024 

16i+16d 

512 

64 

1 

167 

334 
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interesting observations can be made, based on the performance figures. 

MIPS R5000. Figure 2a shows a marked improvement for unrolling the m-loop for the 
z-factor, and, to a lesser extent, for the y- and z-factors. Making m the first array index 
gives a substantial improvement for the y-factor and also some for the z-factor, but reduces 
performance for the z-factor. And, counterintuitively, the z- factor benefits most from the 
natural, large-stride loop ordering, as opposed to the canonical ordering. 

MIPS R8000. Figure 3a shows that unrolling the i-loops for the y- and z-factors greatly 
deteriorates performance, hinting at a problem with the allocation of registers for the fattened 
loop body. Figure 3b clearly demonstrates the reduced processor performance for large grids 
that do not fit completely in cache (sizes 64 and 80). Apparently, the memory bandwidth 
is not sufficient to fill the cache fast enough from main memory. Finally, Figure 3c reveals 
that the generally most cache-friendly z-factor code performs more poorly than the y- and 
z-factors. This may be caused by the interleaved structure of the cache. 

MIPS R10000. Figure 4a shows a strong dependence of problem size on the best optimiza- 
tion strategy. For small grids making m the first array index greatly improves performance, 
but this optimization is completely counterproductive for large grid sizes. In either case, the 
greatest relative performance improvement is obtained by the virtually trivial optimization 
of unrolling the m-loop. Figure 4b again indicates a memory bandwidth problem. Unlike 
for the MIPS R8000, the z-factor now exhibits the best performance again among all three 
factors, as evidenced by Figure 4c. 

IBM RS6000. Figure 5a shows, somewhat surprisingly, that elimination of the auxiliary 
variables il and i2 in the z-factor is beneficial on the RS6000, whereas a similar program 
change for the y- and z-factors has a negative effect. In addition, we notice that the best 
performance for the z-factor is obtained for a partially unrolled y-loop, whereas optimal 
performance for the y- and z-factors is achieved for without unrolling of the i-, j-, and k- 
loops. Finally, moving the m-index proves positive for the x-, neutral for the y-, and negative 
for the z-loop. From Figure 5b we conclude that there is a relatively small effect of grid size 
on performance, indicating that the memory bandwidth is sufficient to feed the cache. 

Intel PentiumPro 200 MHz. Figure 6a demonstrates that there is great sensitivity to 
problem size with respect to the optimizations of the z-factor. But more importantly, Figure 
6c shows that, with proper tuning, a performance of about 20 Mflops/s can be maintained 
for a nontrivial floating point computation on this processor. This places the PentiumPro 
squarely in the scientific workstation performance range. 

DEC Alpha EV4. Evidently, the z-factor performance improves through the unrolling of 
the m-loop, whereas y- and z-factor performances deteriorate under the same code change, 
as evidenced by Figure 7a. It also shows, in contrast with, for example, the RS6000, that 
the best performance is obtained for a partially unrolled j-loop for the z-factor, but for 
completely rolled up i-loops for the y- and z-factors. We also observe in Figure 7c that the 
best results overall are obtained for the z-factor with m as the first array index, w r hich may 
be explained by the small cache of the machine — as modified by Cray Research — and the 
(perceived) good data locality of this code variation. 

DEC Alpha EV5. For this DEC chip with a larger (secondary) cache the performance 
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of the y-factor degrades significantly when switching to the natural loop order, whereas for 
the 2 -factor the effect is mixed. The largest performance improvement for all factors comes 
from moving the m-index to the first array position, with unrolling the i- or j-loops having 
little or no positive effect, as follows from Figure 8a. As for the EV4, the EV5 again shows 
best overall behavior for the x-factor with m as the first array index (Figure 8c). 

Sun UltraSPARC I. From the fairly smooth distribution of performances across columns in 
Figure 9a we deduce that the UltraSparc is not very sensitive to code optimizations by the 
user, except for the very smallest grid size that fits entirely in the cache. Also notice that 
whereas unrolling the i-loop did not help improve the performance of the x-factor, it was 
necessary, combined with natural loop ordering, for best performance of the y- and 2 -loops. 
A quick glance at Figure 9b shows that performance degrades drastically as soon as the 
problem no longer fits entirely in cache, again pointing to a lack of memory bandwidth. 

There are many ways in which the information in Figures 2a through 9a can be summarized. 
We choose a simple one for Table 2. For each processor we count the number of grid sizes 
for which a particular optimization technique is superior over all others. The sum of these 
numbers over all processors is listed as total. The number of processors for which each 
particular optimization is optimal for all grid sizes is listed as unanimous. 


Table 2: Summary of processor performance 


Name 


X‘ 

-factor 




y- 

■factor 





z- 

■factor 




optimizations 


optimizations 




optimizations 



i 

2 

3 

4 

5 

1 

2 

3 

4 

5 

6 

7 

1 

2 

3 

4 

5 

6 

7 

R5000 

0 

0 

0 

0 

4 

0 

0 

0 

0 

3 

1 

0 

0 

0 

0 

0 

0 

4 

0 

R8000 

0 

0 

4 

0 

0 

0 

0 

2 

0 

0 

0 

2 

0 

0 

2 

0 

0 

0 

2 

R10000 

0 

0 

2 

1 

1 

0 

0 

2 

0 

0 

2 

0 

0 

0 

2 

0 

0 

2 

0 

RS6000 

0 

0 

0 

0 

4 

0 

0 

0 

0 

0 

0 

4 

0 

0 

0 

0 

0 

1 

3 

PentiumPro 

0 

0 

0 

4 

0 

0 

0 

0 

4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

4 

DEC EV4 

0 

0 

0 

4 

0 

0 

0 

0 

0 

0 

4 

0 

0 

0 

0 

0 

0 

4 

0 

DEC EV5 

0 

0 

0 

4 

0 

0 

0 

0 

0 

4 

0 

0 

0 

0 

0 

0 

2 

0 

2 

UltraSparc 

0 

0 

0 

4 

0 

0 

0 

0 

0 

4 

0 

0 

0 

1 

0 

0 

0 

3 

0 

total 

0 

0 

6 

17 

9 

0 

0 

4 

4 

11 

7 

6 

0 

1 

4 

0 

2 

14 

11 

unanimous 

0 

0 

1 

4 

2 

0 

0 

0 

1 

2 

1 

1 

0 

0 

0 

0 

0 

2 

1 


Evidently, there is not a single optimization strategy that gives the best performance for 
all grid sizes for all processors. But even if we restrict the attention to one processor at a 
time, most still do not allow a single uniform optimization strategy. Only the PentiumPro 
and the DEC Alpha EV4 chip show consistently best performance (within 1% accuracy) for 
fixed strategies x-4, y-4, 2-7 and x-4, y-6, z-6, respectively. If we rule out unrolling the i, 
j, or k loops ( y-5,6 , z-5,6), then the generally most acceptable single optimization strategy 
is: x-4, y-7, z-1 , i.e. the canonical loop ordering, with large strides for the y- and 2 -factors, 
and m as the first array index. We notice that the scatter in the optimal tuning strategies is 
substantially larger for the y- and 2 -factors than for the x-factor. 
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5 Cache simulations 


It follows from the results in the previous section that intuition is a poor guide to performance 
tuning. More detailed knowledge and insight are needed to design a good optimization 
strategy. In order to produce code with portable performance characteristics, we limit the 
type of knowledge to enter into the investigation to relatively easily quantifiable parameters 
on arbitrary systems. Specifically, we choose to simulate the cache behavior of the same 
set of processors as before to understand better the observed computational results. The 
model for the simple cache simulator is as follows. The highest-level cache is assumed to 
constitute the memory bottleneck; its parameters are used for the simulator. In case of 
set-associative caches a Least Recently Used (LRU) replacement policy of cache lines is 
employed. No knowledge about the cache other than line length, number of lines, and 
associativity, is incorporated; For example, the effects of cache memory interleaving, and 
of sophisticated strategies such as early restart and requested word first, are not taken into 
account. Moreover, we assume a separate data cache, since unified caches require vastly more 
information for simulation, including the assembler version of the code. We simply transform 
the assignments in the source code by hand to calls to the cache simulation routines. Only 
memory loads are modeled. The structure of the computational loops is such that writes 
are always done to memory locations that are already in cache, which means that no cache 
lines need to be flushed to accommodate write operations. We ignore the effects of writing 
through the cache to main memory. This is a potentially serious simplification. Depending 
on the buffering strategy and the implementation of the cache controller, substantial stalling 
may be incurred due to write operations. 

Results of the simulations are presented in Appendix C. Each figure shows the percentage 
of memory accesses (fetches) that cause cache misses for the optimizations described in 
Appendix A (all code fragments exhibit the same total number of array references). Tallying 
cache hits and misses starts after each loop nest has been executed once. This fills the cache 
with (potentially) useful data beforehand, which is equivalent to the preconditioning in the 
actual performance measurements described in Section 4. Note that optimization strategies 
x-2, x-3, y-2, y-3, z-2 and z-3 are not displayed, since these feature exactly the same memory 
access patterns as x-1, y-1 and z-1, respectively. Shading is applied in the same fashion as 
in Figures a of Appendix B. The darker the shading, the fewer cache misses. Since we expect 
the percentage of cache misses to be smaller for better performing code fragments, we call 
this relation between the two a positive correlation. 

MIPS R5000. Figure 10 exhibits a positive correlation between performance and cache 
misses for the x- and 2 -factors, but a (mostly) negative one for y. We also notice that 
there is a large difference in the performance of optimizations x-1 through x-3, although 
the memory reference patterns are identical. Finally, we observe that the number of cache 
misses for the 2 -factor is slashed in three — somewhat unexpectedly — by using the natural 
loop ordering. 

MIPS R8000. Figure 11 features a very small miss rate (less than 1%), and virtually no 
correlation between performance and cache misses. 

MIPS R10000. Again there is very poor correlation between processor performance and 
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miss rate (Figure 12), and in the case of a grid size of 32 3 points the correlation is even 
negative. 

IBM RS6000. As in the case of the MIPS R5000, the number of cache misses for the 
z-factor is reduced significantly by reverting to the natural loop order for large grids. Figure 
13 indicates a fair correlation between performance and cache misses overall. 

Intel PentiumPro 200 MHz. Although there is a clear positive correlation between cache 
misses and performance for the z-factor, Figure 14 reveals a negative and mixed correlation 
for the y- and z-factors, respectively. And again, natural loop ordering benefits the memory 
access pattern of the z-factor. 

DEC Alpha EV4. Figure 15 confirms a positive correlation between cache misses and 
performance for the z-factor, and negative and mixed correlations for z and y. It is of some 
interest to note the erratic simulated cache behavior for the y-factor, with y- 7. y-1, y-7 and 
y-6 being the optimal strategies for grid sizes 16 3 , 32 3 , 64 3 and 80 3 , respectively. 

DEC Alpha EV5. Figure 16 points out that reductions of cache misses by up to a factor 
of 3 for the larger grids of the z-factor by switching to the natural loop ordering has no effect 
or no positive effect on the performance. The y-factor shows widely varying performance 
figures for identical numbers of cache misses. 

Sun UltraSPARC I. Performance of the UltraSparc for small grids (16 3 ) varies significantly, 
especially for the z-factor, although the problem fits entirely in the cache, and no misses 
occur. For larger grids the number of cache misses varies widely for the z-factor, with little 
difference in processor performance. 

Evidently, the simulated cache behavior for the processors studied correlates poorly with 
the actually observed performance. It can be argued that this is due to the limitations of 
the simulator, and that better correlations can be obtained by incorporating more detailed 
characteristics of the particular machines. But since we are interested in producing portable 
code, introducing even more information into the program construction will make this task 
virtually impossible. We also notice that there is significant — and nontrivial — dependence 
of the processor performance on the problem size, which is generally not known at compile 
time. This makes the task of writing portable codes even harder. Moreover, the cache miss 
rate of the simple piece of code investigated in this paper is a complex function of loop 
organization, problem size, and cache structure, even if only very few parameters are used 
to describe the cache. 

6 Conclusions 

We conclude that tuning codes for high performance on commercial cache-based processors 
available today is a process that requires so much knowledge and information about the 
entire configuration of user program, problem parameters, system software, and hardware 
organization, that it is practically impossible to produce ‘good cache code’. By this we mean 
that it is not possible to tell whether a piece of numerical software will perform well by 
using textual inspection and a few high-level parameters of the system under consideration; 
General optimization strategies designed to take advantage of cache do not alone yield high 
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performance. Other factors, such as software pipelining, need to be considered as well when 
designing efficient codes, but this will hamper portability of codes between architectures. 

Finally, it should be noted that extreme care must be taken to extend the optimization 
results obtained from the loops in Appendix A to entire application programs in case data 
lay-outs are changed. For example, it may appear natural to move the m-index to the front 
of the rhs and lhs arrays to increase cache data reuse. But for loops in which only, say, the 
first element of rhs is used, performance may suffer, due to ‘gaps’ in the array. 
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Appendix A: Codes 


Below follow the source listings of the SP solver fragments used in the performance tests; 
only the forward elimination parts of the penta-diagonal line solvers are used. Diagonal- 
ization makes it possible to compute three solution values (rhs(. . ,m=l,3)) per grid point 
simultaneously from a single penta-diagonal matrix (Ihs). For each of the three factors to 
be inverted in the ADI scheme (corresponding to the x-, j/-, and 2 -directions), the same opti- 
mizations are performed for corresponding suffixes. For example, x-3 , y-3 and z-3 all unroll 
inner loops of length three, x-1 , y-1 and z-1 are the actual code fragments used in NPB 2. 
Here we only show x-1 through z-5, y-6, and y-7. The other code fragments are easily in- 
ferred. Italic typeface is used to indicate which parts of the code fragments are affected by the 
optimizations. The dimensions of the arrays are rhs(nx,ny,nz,3) and lhs(nx,ny ,nz,5), 
respectively, when m is the last index, and rhs(3,nx,ny,nz) and lhs(5,nx,ny,nz), respec- 
tively, when m is the first index. 


x-1: NPB 2.2 code 

do k - 1 , nz 
do j = 1, ny 

do i = 1 , nx-2 

11 = i +1 

12 = i +2 

facl = l.d0/lhs(i,j,k,3) 

lhs(i,j,k,4) » facl*lhs(i , j ,k,4) 

Ihs (i , j ,k, 5) * facl*lhs(i, j ,k,5) 

do m = 1 , 3 

rhs(i, j ,k,m) = f acl*rhs (i , j ,k,m) 
end do 

lhs(il , j ,k,3) = lhs(il , j ,k,3) - lhs(il , j ,k,2)*lhs(i , j ,k,4) 
lhs(il , j ,k,4) = lhs(il , j ,k,4) - lhs(il , j ,k,2)*lhs(i , j ,k,5) 
do m = 1 , 3 

rhs(il , j ,k,m) = rhs (il , j ,k,m) - lhs(il , j ,k,2)*rhs(i, j ,k,m) 
end do 

lhs(i2, j ,k,2) = lhs(i2, j ,k,2) - lhs(i2, j ,k, l)*lhs(i , j ,k,4) 
lhs(i2, j ,k,3) = lhs(i2, j ,k,3) - lhs(i2, j ,k,l)*lhs(i , j ,k,5) 
do m = 1, 3 

rhs(i2, j ,k,m) = rhs (12 , j ,k,m) - lhs(i2, j ,k, l)*rhs(i, j ,k,m) 
end do 
end do 
end do 
end do 

x-2 : do not use auxiliary variables for incremented indices 

do k = 1 , nz 
do j = 1, ny 

do i = 1 , nx-2 

facl = 1 .dO/lhs(i, j ,k,3) 
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lhs (i , j ,k,4) = facl*lhs(i , j ,k,4) 

lhs(i , j »k , 5) = facl*lhs(i, j ,k,5) 

do m = 1 , 3 

rhs(i,j,k,m) = f acl*rhs (i , j ,k,m) 
end do 

lhs (i+1 ,k,3) = lhsd+1, j ,k,3) - lhs(i+l, j ,k,2)*lhs(i , j ,k,4) 
lhstt+1* j ,k,4) = lhs(i+l t j ,k,4) - lhs(i+l t j ,k,2)*lhs(i , j ,k,5) 
do m = 1 , 3 

rhs(i+l, j ,k,m) = rhs(«>i, j ,k,m) - lhs(i+l 9 j ,k,2)*rhs(i, j ,k,m) 
end do 

lhs(i+2, j ,k,2) = lhsd+2» j ,k,2) - Ihs(i+2> j ,k, l)*lhs(i, j ,k,4) 
lhs(i+2 ,k,3) = lhs(i+2> j ,k, 3) - lhs(i+2, j ,k, 1) *lhs (i , j ,k,5) 
do m = 1 , 3 

rhs(i+2, j ,k,m) = rhs (i+2> j ,k,m) - lhs(z-f^, j ,k, 1) *rhs(i , j ,k,m) 
end do 
end do 
end do 
end do 

x-3: unroll small inner loops of length 3 
do k = 1 , nz 
do j = 1, ny 

do i = 1 , nx-2 

facl = 1 ,dO/lhs(i, j ,k,3) 

lhs(i,j,k,4) = facl*lhs(i, j ,k,4) 

lhs(i,j,k,5) = facl*lhs(i, j ,k,5) 

rhs(i,j,k,l) = fael*rhs(i,j,k,l) 
rhs(i,j,k,2) - facl*rhs(i,j,k,2) 
rhs(i,j } k,3) — facl*rhs(i,j,k,3) 

lhs(i+l, j ,k,3) = lhs(i+l, j ,k,3) - lhs(i+l,j ,k,2)*lhs(i, j ,k,4) 
lhs (i+1 » j ,k,4) = lhs (i+1 , j ,k,4) - lhs(i+l , j ,k,2)*lhs(i , j ,k,5) 

rhs(i+l,j,k,l) - rhs(i+l,j,k,l) - lhs(i+l,j,k J 2)*rhs(i,j,k,l) 
rhs(i+l,j,k,2) = rhs(i+l,j,k,2) - lhs(i+l,j,k,2)*rhs(i,j,k,2) 
rhs(i+l,j,k,3) == rhs(i+l,j t k,3) - lhs(i+l,j,k,2)*rhs(i,j,k,3) 
lhs (i+2 , j ,k,2) = lhs(i+2, j ,k,2) - lhs(i+2, j ,k, l)*lhs(i, j ,k,4) 
lhs(i+2, j ,k,3) = lhs (i+2 ,j ,k,3) - lhs(i+2, j ,k, l)*lhs(i , j ,k,5) 
rhs(i+2,j,k,l) = rhs(i+2,j,k,l) - lhs(i+2,j,k,l)*rhs(i,j,k,l) 
rhs(i+2,j,k,2) = rhs(i+2,j,k,2) - lhs(i+2,j,k,l)*rhs(i,j,k,2) 
rhs(i+2,j,k,3) = rhs(i+2 7 j,k,3) - lhs(i+2,j,k,l)*rhs(i 7 j 7 k,3) 
end do 
end do 
end do 

x-4- move last index of rhs and lhs to the front 

do k = 1 , nz 
do j = 1, ny 

do i = 1, nx-2 
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facl 

lhs(^,i, j ,k) 
lhs (5, i , j ,k) 
rhs ( i , i , j , k) 
rhs (£, i , j , k) 
rhs(3,i, j ,k) 
lhs(3,i+l , j »k) 
lhs (4 , i + l , j ,k) 
rhs (i , i+1 , j »k) 
rhs(2,i+l , j ,k) 
rhs (5, i+l , j ,k) 
lhs(2,i+2, j ,k) 
lhs(3,i+2, j ,k) 
rhs(i,i+2, j ,k) 
rhs(£,i+2, j ,k) 
rhs(5,i+2, j ,k) 
end do 
end do 
end do 


1 .d0/lhs(3, i , j ,k) 
facl^lhsC^i, j ,k) 
facl*lhs(5,i, j ,k) 
facl+rhsC^ji, j ,k) 
f acl*rhs(i?,i , j ,k) 
facl*rhs(3,i, j ,k) 

lhs(5,i+l,j,k) - lhs(2,i+l , j ,k)*lhs(^,i, j ,k) 
lhs(^,i+l, j ,k) - lhs(£,i+l , j ,k)*lhs(5,i, j ,k) 
rhs(i , i+l , j ,k) - lhs(£,i+l, j ,k)*rhs(i,i, j ,k) 
rhs(2, i+l , j »k) - lhs(#,i+l, j ,k)*rhs(2,i, j ,k) 
rhs(5, i+l , j ,k) - lhs(£,i+l , j ,k)*rhs(3,i, j ,k) 
lhs(2,i+2, j ,k) - lhs(l,i+2, j ,k)*lhs(^,i, j ,k) 
lhs(3,i+2, j ,k) - lhs ( 1 , i+2 , j ,k) *lhs (5, i , j ,k) 
rhs(i,i+2, j ,k) - lhs(i,i+2, j ,k)*rhs(i, i , j ,k) 
rhs(2,i+2, j ,k) - lhs(i,i+2, j ,k)*rhs(£,i, j ,k) 
rhs(3,i+2, j ,k) - lhs ( 1 , i+2 , j ,k) *rhs (5, i , j ,k) 


x-5: unroll j-loop to a level of 2 

do k = 1 , nz 

do j = 1, ny, 2 
do i = 1 , nx-2 

facl = 

fac2 

lhs(4,i , j ,k) 
lhs (5 , i , j ,k) = 

rhs (1 , i , j ,k) 
rhs (2 , i , j ,k) 
rhs (3 , i , j ,k) 
lhs(3,i+l , j ,k) * 

lhs (4 , i+l , j ,k) 
rhs(l , i+l , j ,k) 
rhs(2,i+l , j ,k) 
rhs(3,i+l , j ,k) 
lhs(2,i+2, j ,k) = 

lhs(3,i+2, j ,k) 
rhs(l , i+2 , j ,k) 
rhs(2,i+2, j ,k) = 

rhs(3,i+2, j ,k) = 

lhs(4 , i ,j+l ,k) - 

lhs(5 ,i ,j+l ,k) = 

rhs(l,i, j-f i,k) 
rhs (2 ,i,j+l 9 k) = 

rhs(3,i,j-f i,k) - 
lhs(3,i+l,j^,k) = 


l.d0/lhs(3,i, j ,k) 

1 . dO/lhs (3 , i , 7 ’-/-! ,k) 
facl*lhs(4,i, j ,k) 
facl*lhs(5,i, j ,k) 
f acl*rhs(l ,i, j ,k) 
facl*rhs(2,i, j ,k) 
facl*rhs(3,i, j ,k) 

lhs(3,i+l, j ,k) - lhs(2,i+l, j ,k)*lhs(4,i , j ,k) 
lhs(4,i+l , j ,k) - lhs(2,i+l, j ,k)*lhs(5,i , j ,k) 
rhs(l , i+l , j ,k) - lhs(2,i+l , j ,k)*rhs(l ,i , j ,k) 
rhs(2,i+l , j ,k) - lhs(2,i+l , j ,k)*rhs(2,i , j ,k) 
rhs(3,i+l, j ,k) - lhs(2,i+l , j ,k)*rhs(3,i , j ,k) 
lhs(2,i+2, j ,k) - lhs(l,i+2, j ,k)*lhs(4,i, j ,k) 
lhs(3,i+2, j ,k) - lhs(l,i+2, j ,k)*lhs(5,i, j ,k) 
rhs (1, i+2, j ,k) - lhs(l , i+2 , j ,k) *rhs(l , i , j ,k) 
rhs(2,i+2, j ,k) - lhs(l ,i+2, j ,k)*rhs(2,i , j ,k) 
rhs(3,i+2, j ,k) - lhs(l ,i+2, j ,k)*rhs(3,i , j ,k) 
f ac2*lhs(4,i , j+l,k) 
f ac2*lhs(5,i , j+l,k) 
f ac2*rhs(l , i ,j+l, k) 
f ac2*rhs(2,i f j+l,k) 
f ac2*rhs(3,i $ j+ !,k) 

lhs (3, i+l >j+l > k) - lhs (2, i+l f j+l ,k) *lhs(4,i ,j-/-i,k) 
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lhs (4, i+1 ,j+l ,1s.) = lhs(4,i+l , j+1, Is) 
rhs(l,i+l,j-/-i,k) = rhs(l , i+1 ,j+l ,k) 
rhs (2, i+1 ,j+l ,k) = rhs(2 , i+1 ,j+l ,k) 
rhs(3,i+l,j-/-2,k) = rhs(3,i+l ,j+l ,k) 
lhs(2,i+2,j>/,k) = lhs(2,i+2,j-f 1 ,k) 
lhs (3, i+2 , j+1 ,1s) = lhs(3,i+2,ji L i,k) 
rhs(l ,i+2,j-/-/,k) = rhs( 1 , i+2 ,j+l ,k) 
rhs(2,i+2,j-/-l,k) = rhs(2 , i+2 , j+ 1 ,k) 
rhs(3,i+2,j-/'i,k) = rhs (3 , i+2 , j+ 1 ,k) 
end do 
end do 
end do 

y-6: canonical loop order , i-loop unrolled to level 2 

do k = 1 , nz 

do i 8 1, nx, 2 

do j = 1, ny-2 


lhs (2, i+1 ,j+ 1 ,k)*lhs(5,i ,j+l ,k) 
lhs (2, i+1 ,j+l ,k) *rhs (1 ,± ,j+l ,1s) 
lhs (2, i+1 J+1, k) *rhs (2 , i ,j+l ,k) 
lhs (2, i+1 , j+1 ,k)*rhs(3 ,i , j+1 ,1s) 
lhs (1 , i+2, j+1 ,k) *lhs(4, i ,j+l ,k) 
lhs(l , i+2, j+1 ,1s)*lhs(b,i J+1 ,1s) 
lhs ( 1 , i+2 , j+1 ,k) *rhs ( 1 , i , j+1 , k) 
lhs ( 1 , i+2 , j+ 1 ,k) *rhs (2 , i ,j+l , k) 
lhs ( 1 , i+2 ,j+l , k) *rhs (3 , i ,j+l ,k) 


facl = 1 .dO/lhs (3 , i , j ,k) 

f ac2 = 1 .dO/lhs (3 , i+1 J ,1s) 

lhs(4,i,j,k) = f acl*lhs(4, i, j ,k) 

lhs(5,i,j,k) = facl*lhs(5,i, j ,k) 

rhs(l,i,j,k) = facl*rhs(l,i, j ,k) 

rhs(2,i,j,k) = f acl*rhs (2,i , j ,k) 

rhs(3,i,i.k) = f acl*rhs (3*i . i ,k) 


lhs(3,i, j+1 ,k) = lhs(3,i, j+1 , k) - 
lhs(4,i, j+l,k) = lhs (4 , i , j+1 ,k) - 
rhs(l,i, j+l,k) = rhs(l , i , j+1 ,k) - 
rhs(2,i, j+l,k) = rhs(2 , i , j+1 ,k) - 
rhs(3,i, j+l,k) = rhs(3 , i , j+1 ,k) - 
lhs(2,i, j+2,k) = lhs(2,i, j+2,k) - 
lhs(3,i, j+2,k) = lhs(3,i, j+2,k) - 
rhs(l,i, j+2,k) = rhs(l ,i , j+2,k) - 
rhs(2,i, j+2,k) = rhs (2 , i , j+2 ,k) - 
rhs(3,i, j+2,k) = rhs (3, i , j+2 ,k) - 
lhs(4, i+1 , j ,k) = fac2*lhs(4,zi L i, j 
lhs(5,i-f 1, j ,k) = fac2*lhs(5,«>i, j 
rhs(l,i-/-l, j ,k) = f ac2*rhs(l,z>i, j 
rhs(2, i+1 , j ,k) = f ac2*rhs(2 , i+1 , j 
rhs(3, i+1 , j ,k) = f ac2*rhs(3, i+ 1, j 
lhs(3 , i+1 , j+1 ,k) = lhs(3,z-/-i, j+l,k) 
lhs(4 , i+1 , j+1 ,k) = lhs (4, i+1, j+1 ,k) 
rhs(l , i+1 , j+l,k) = rhs(l , i+1, j+1 ,k) 
rhs(2, i+1 , j+l,k) = rhs(2,i-/-i, j+l,k) 
xhs(3 , i+1 , j+1 ,k) = rhs (3 ,i+l , j+1 ,k) 
lhs(2, i+1 , j+2,k) = lhs(2, i+1 , j+2,k) 
lhs(3,i^ , j+2,k) = lhs(3,i-/-i, j+2,k) 


lhs(2,i , j+1 ,k)*lhs(4,i , j ,k) 
lhs (2, i , j+1 ,k)*lhs(5,i , j ,k) 
lhs (2, i , j+1 ,k) *rhs(l ,i , j ,k) 
lhs(2,i, j+1 ,k)*rhs(2,i , j ,k) 
lhs(2,i , j+1 ,k)*rhs(3,i , j ,k) 
lhs(l , i, j+2,k)*lhs(4,i , j ,k) 
lhs(l ,i, j+2,k)*lhs(5,i , j ,k) 
lhs(l , i, j+2,k)*rhs(l , i , j ,k) 
lhs(l ,i, j+2,k)*rhs(2,i , j ,k) 
lhs(l , i, j+2,k)*rhs(3,i , j ,k) 

»k) 

»k) 

,k) 

»k) 

- lhs(2,«-/-i,j+l,k)*lhs(4,z-/-^,j ,k) 

- lhs(2 , i+1 , j+1 ,k)*lhs(5,i-/-/, j ,k) 

- lhs(2,i-/-i, j+1 ,k)*rhs(l ,i+l , j ,k) 

- lhs(2,i+l , j+l,k)*rhs(2,i-fi,j ,k) 

- lhs(2, i+1 , j+1 ,k) *rhs (3 , i+ 1 , j ,k) 

- lhs(l,i-f 1 , j+2,k)*lhs(4,i-fi, j ,k) 

- lhs(l , i+1 , j+2,k)*lhs(5,i-/-i , j ,k) 
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rhs(l, i+1, j+2,k) = rhs(l, z-fl,j+2,k) - lhs(l ,i+l, j+2 ,k)*rhs(l , i+1 , j ,k) 
rhs(2,i-/-l, j+2,k) = rhs(2 , i+1 , j+2 ,k) - lhs(l ,i+l, j+2 ,k)*rhs(2 , i+1 , j ,k) 
rhs(3, i-fl, j+2, k) = rhs(3, i-fl, j+2, k) - lhs(l ,i+l, j+2 ,k)*rhs(3 , i+1 , j ,k) 

end do 
end do 
end do 

y-7: canonical loop order , rolled-up i-loop 
do k = 1 , nz 
do i = 1 , nx 
do j = 1, ny-2 

facl = 1 .d0/lhs(3,i, j ,k) 

lhs(4,i, j ,k) = facl*lhs(4,i» j .k) 

lhs (5 , i , j ,k) = facl*lhs(5,i, j ,k) 

rhs(l ,i, j ,k) = facl*rhs(l,i, j ,k) 

rhs(2,i,j,k) = facl*rhs(2,i, j ,k) 
rhs(3,i, j ,k) = faci*rhs(3,i, j ,k) 

lhs(3,i, j+l,k) = lhs(3,i,j+l,k) - lhs(2,i,j+l,k)*lhs(4,i.j »k) 
lhs(4,i, j+l,k) = lhs(4,i, j+l,k) - lhs(2,i,j+i,k)*lhs(5,i,j .k) 
rhs(l,i,j+l,k) = rhs(l ,i, j+1 ,k) - lhs(2, i , j+1 ,k)*rhs(l , i , j ,k) 
rhs(2,i, j+l,k) = rhs(2,i, j+l,k) - lhs(2,i, j+l,k)*rhs(2,i, j ,k) 
rhs(3,i, j+1 ,k) = rhs(3,i. j+1 ,k) - lhs(2,i, j+1 ,k)*rhs(3,i,j ,k) 
lhs (2, i, j+2, k) = lhs (2, i, j+2, k) - lhs(l ,i, j+2,k)*lhs(4,i , j ,k) 
lhs(3,i , j+2 ,k) = lhs (3, i, j+2, k) - lhs(l,i, j+2,k)*lhs(5,i, j ,k) 
rhs(l,i,j+2,k) = rhs(l,i, j+2,k) - lhs(l,i,j+2,k)*rhs(l,i,j ,k) 
rhs(2,i, j+2,k) = rhs(2,i, j+2,k) - lhs(l,i, j+2,k)*rhs(2,i, j ,k) 
rhs(3,i, j+2 ,k) = rhs(3,i, j+2 ,k) - lhs(l,i, j+2,k)*rhs(3,i, j ,k) 

end do 
end do 
end do 
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a) Grid size 
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c) Grid size 

16 32 64 80 

X-1 22.41 22.43 17.22 17.27 
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Figure 3: Performance (Mflop/s) for MIPS R8000 (SGI PowerChallenge) 
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Figure 4: Performance (Mflop/s) for MIPS R.10000 (SGI 0rigin2000) 
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Figure 7: Performance (Mflop/s) for DEC Alpha EV4 (Cray T3D) 


a) Grid size 
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Figure 8: Performance (Mflop/s) for DEC Alpha EV5 (Cray T3E) 
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Appendix C: Simulated cache performance 
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Figure 10: Cache 

misses (%), MIPS R5000 

Figure 11: Cache misses (%), MIPS R8000 
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Figure 16: Cache misses (%), DEC EV5 



Figure 17: Cache misses (%), UltraSparc 

















